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Abstract 

While most useful information theoretic inequalities can be deduced from the basic properties of entropy or mutual 
information, up to now Shannon's entropy power inequality (EPI) is an exception: Existing information theoretic proofs 
of the EPI hinge on representations of differential entropy using either Fisher information or minimum mean-square 
error (MMSE), which are derived from de Bruijn's identity. In this paper, we first present an unified view of these 
proofs, showing that they share two essential ingredients: 1) a data processing argument applied to a covariance- 
preserving linear transformation; 2) an integration over a path of a continuous Gaussian perturbation. Using these 
ingredients, we develop a new and brief proof of the EPI through a mutual information inequality, which replaces 
Stam and Blachman's Fisher information inequality (FII) and an inequality for MMSE by Guo, Shamai and Verdu 
used in earlier proofs. The result has the advantage of being very simple in that it relies only on the basic properties 
of mutual information. These ideas are then generalized to various extended versions of the EPI: Zamir and Feder's 
generalized EPI for linear transformations of the random variables, Takano and Johnson's EPI for dependent variables, 
Liu and Viswanath's covariance-constrained EPI, and Costa's concavity inequality for the entropy power. 

Index Terms 

Entropy power inequality (EPI), differential entropy, mutual information, data processing inequality, Fisher in- 
formation inequality (FII), Fisher information, de Bruijn's identity, minimum mean-square error (MMSE), relative 
entropy, divergence. 

I. Introduction 

In his 1948 historical paper, Shannon proposed the entropy power inequality (EPI) [1, Thm. 15], which asserts 
that the entropy power of the sum of independent random vectors is at least the sum of their entropy powers; 
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equality holds iff 1 the random vectors are Gaussian with proportional covariances. The EPI is one of the deepest 
inequalities in information theory, and has a long history. Shannon gave a variational argument [1, App. 6] to show 
that the entropy of the sum of two independent random vectors of given entropies has a stationary point where the 
two random vectors are Gaussian with proportional covariance matrices, but this does not exclude the possibility 
that the stationary point is not a global minimum. Stam [2] credits de Bruijn with a first rigorous proof of the 
EPI in the case where at most one of the random vectors is not Gaussian, using a relationship between differential 
entropy and Fisher information now known as de Bruijn' s identity. A general proof of the EPI is given by Stam 
[2] (see also Blachman [3]), based on a related Fisher information inequality (FII). Stam's proof is simplified in 
[4] and [5]. Meanwhile, Lieb [6] proved the EPI via a strengthened Young's inequality from functional analysis. 
While Lieb's proof does not use information theoretic arguments, Dembo, Cover and Thomas [4] showed that it 
can be recast in a unified proof of the EPI and the Brunn-Minkowski inequality in geometry (see also [7], [8]), 
which was included in the textbook by Cover and Thomas [9, § 17.8]. Recently, Guo, Shamai and Verdu [10] found 
an integral representation of differential entropy using minimum mean-square error (MMSE), which yields another 
proof of the EPI [11], [12]. A similar, continuous-time proof via causal MMSE was also proposed by Binia [13]. 
The original information theoretic proofs (by Stam and Blachman, and by Verdu, Guo and Shamai) were first given 
for scalar random variables, and then generalized to the vector case either by induction on the dimension [2], [3] 
or by extending the required tools [4], [11]. 

The EPI is used to bound capacity or rate-distortion regions for certain types of channel or source coding 
schemes, especially to prove converses of coding theorems in the case where optimality cannot be resolved by 
Fano's inequality alone. Shannon used the EPI as early as his 1948 paper [1] to bound the capacity of non-Gaussian 
additive noise channels. Other examples include Bergmans' solution [14] to the scalar Gaussian broadcast channel 
problem, generalized to the multiple-input multiple-output (MIMO) case in [15], [16]; Leung- Yan Cheong and 
Hellman's determination of the secrecy capacity of the Gaussian wire-tap channel [17], extended to the multiple 
access case in [18], [19]; Costa's solution to the scalar Gaussian interference channel problem [20]; Ozarow's 
solution to the scalar Gaussian source two-description problem [21], extended to multiple descriptions at high 
resolution in [22]; and Oohama's determination of the rate-distortion regions for various multiterminal Gaussian 
source coding schemes [23]-[26]. It is interesting to note that in all the above applications, the EPI is used only in 
the case where all but one of the random vectors in the sum are Gaussian. The EPI for general independent random 
variables, as well as the corresponding FII, also find application in blind source separation and deconvolution in the 
context of independent component analysis (see, e.g., [27]-[29]), and is instrumental in proving a strong version 
of the central limit theorem with convergence in relative entropy [5], [30]-[35]. 

It appears that the EPI is perhaps the only useful information theoretic inequality that is not proved through 
basic properties of entropy or mutual information. In this paper, we fill the gap by providing a new proof, with the 
following nice features: 

'if and only if. 
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• it hinges solely on the elementary properties of Shannon's mutual information, sidestepping both Fisher's 
information and MMSE. Thus, it relies only on the most basic principles of information theory; 

• it does not require scalar or vector identities such as de Bruijn's identity, nor integral representations of 
differential entropy; 

• the vector case is handled just as easily as the scalar case, along the same lines of reasoning; and 

• it goes with a mutual information inequality (Mil), which has its own interest. 

Before turning to this proof, we make a detailed analysis of the existing information theoretic proofs 2 of the EPI. 
The reasons for this presentation are as follows: 

• it gives some idea of the level of difficulty that is required to understand conventional proofs. The new proof 
presented in this paper is comparatively simpler and shorter; 

• it focuses on the essential ingredients common to all information theoretic proofs of the EPI, namely data 
processing inequalities and integration over a path of continuous Gaussian perturbation. This serves as a 
insightful guide to understand the new proof which uses the same ingredients, though in an more expedient 
fashion; 

• it simplifies some of the conventional argumentation and provides intuitive interpretations for the Fisher 
information and de Bruijn's identity, which have their own interests and applications. In particular, a new, 
simple proof of a (generalized) de Bruijn's identity, based on a well-known estimation theoretic relationship 
between relative entropy and Fisher information, is provided; 

• it offers a unified view of the apparently unrelated existing proofs of the EPI. They do not only share essentials, 
but can also be seen as variants of the same proof; and 

• it derives the theoretical tools that are necessary to further discuss the relationship between the various 
approaches, especially for extended versions of the EPI. 

The EPI has been generalized in various ways. Costa [36] (see also [37]) strengthened the EPI for two random 
vectors in the case where one of these vectors is Gaussian, by showing that the entropy power is a concave function 
of the power of the added Gaussian noise. Zamir and Feder [38]-[40] generalized the scalar EPI by considering the 
entropy power of an arbitrary linear transformation of the random variables. Takano [41] and Johnson [42] provided 
conditions under which the original EPI still holds for two dependent variables. Recently, Liu and Viswanath [43], 
[44] generalized the EPI by considering a covariance-constrained optimization problem motived by multiterminal 
coding problems. The ideas in the new proof of the EPI presented in this paper are readily extended to all these 
situations. Again, in contrast to existing proofs, the obtained proofs rely only on the basic properties of entropy 
and mutual information. In some cases, further generalizations of the EPI are provided. 

The remainder of this paper is organized as follows. We begin with some notations and preliminaries. Section II 
surveys earlier information theoretic proofs of the EPI and presents a unified view of the proofs. Section III gives 
the new proof of the EPI, along with some discussions and perspectives. The reader may wish to skip directly 

2 Lieb's excepted, since it belongs to mathematical analysis and cannot be qualified as an "information theoretic" proof. 
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to the proof in this section, which does not use the tools presented earlier. Section IV extends the new proof to 
Zamir and Feder's generalized EPI for arbitrary linear transformations of independent variables. Section V adapts 
the new proof to the case of dependent random vectors, generalizing the results of Takano and Johnson. Section VI 
generalizes the new proof to an explicit formulation of Liu and Viswanath's EPI under a covariance constraint, 
based on the corresponding MIL Section VII gives a proof of the concavity of the entropy power (Costa's EPI) 
based on the Mil, which relies only on the properties of mutual information. Section VIII concludes this paper 
with some open questions about a recent generalization of the EPI to arbitrary subsets of independent variables 
[45]-[48] and a collection of convexity inequalities for linear "gas mixtures". 

A. Notations 

In this paper, to avoid log e factors in the derivations, information quantities are measured in nats — we shall use 
only natural logarithms and exponentials. Random variables or vectors are denoted by upper case letters, and their 
values denoted by lower case letters. The expectation E(-) is taken over the joint distribution of the random variables 
within the parentheses. The covariance matrix of a random (column) n-vector X is Cov(X) = E{(X — E(X))(X — 
E(X)f), and its variance is the trace of the covariance matrix: Var(X) = tr (Cov(X)) = E(||X - E(X)\\ 2 ). We 
also use the notation a\ = ^Var(X) for the variance per component. We say that X is white if its covariance 
matrix is proportional to the identity matrix, and standard if it has unit covariance matrix Cov(X) = I. 

With the exception of the conditional mean E(X\Y), which is a function of Y, all quantities in the form f(X\Y) 
used in this paper imply expectation over Y, following the usual convention for conditional information quantities. 
Thus the conditional covariance matrix is Cov(X\Y) = E((X - E(X\Y))(X - E(X\Y)Y), and the conditional 
variance is \Zsr{X\Y) = tr (Cov(X|Y)) = E(||X - E(X|Y)|| 2 ), that is, the MMSE in estimating X given the 
observation Y, achieved by the conditional mean estimator X(Y) = E(X\Y). 

The diagonal matrix with entries is denoted by diag^)^ We shall use the partial ordering between real 
symmetric matrices where A < B means that the difference is positive semidefinite, that is, for any real vector x, 
x t Ax < x t Bx. Clearly A < B implies CAC < CBC for any symmetric matrix C, and B 1 < A -1 if A and 
B are invertible. 

Given a function f(x), || denotes the gradient, a (column) vector of partial derivatives and denotes 

the Hessian, a matrix of second partial derivatives ( gx J x . We shall use Landau's notations o(/) (a function 
which is negligible compared to / in the neighborhood of some limit value of x) and O(f) (a function which is 
dominated by / in that neighborhood). 

B. Entropy-Power Inequalities (EPI) 

The entropy power N(X) of a random n-vector X with density p(x) is [1] 

N(X) = — (1) 
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where H(X) = Elog^p^y is the (differential) entropy of X. In the following, we assume that entropies are well 
defined, possibly with the convention that H{X) = — oo or N(X) = if X has a probability mass assigned to 
one or more singletons in W 1 . The scaling properties 

H(aX) = H(X) + n\og\a\ 

(2) 

N(aX) = a 2 N(X), 
where o£l, follow from the definitions by a change of variable argument. 

The non-Gaussianness of X is the relative entropy (divergence) with respect to a Gaussian random vector X* 
with identical second moments: 

D(X\\X*) = H(X*) - H(X) (3) 

where H(X*) = | log((27re)™|Cov(X)|). Since (3) is nonnegative and vanishes iff X is Gaussian, the entropy 
power (1) satisfies the inequalities 

AT(^) < ICov(^)! 1 /" < (4) 

with equality in the first inequality iff X is Gaussian, and in the second iff X is white. In particular, N(X) is the 
power of a white Gaussian random vector having the same entropy as X. 

From these observations, it is easily found that Shannon's EPI can be given several equivalent forms: 
Proposition 1 (Equivalent EPIs): The following inequalities, each stated for any finitely many independent ran- 
dom vectors (Xi)i with densities and real-valued coefficients (a,i)i, are equivalent. 

AT£>X;) > ^(^(X;), (5a) 

i i 

H(J2 OiXi) > H(J2 <aXi), (5b) 

i i 

H(J2 OiXi) > J2 a 1 H ( X i) (E a * = w 

i i i 

where the (Xi)i are independent Gaussian random vectors with proportional covariances (e.g., white) and corre- 
sponding entropies H(Xi) = H(Xi). 

We have presented weighted forms of the inequalities to stress the similarity between (5a)-(5c). Note that by (2), 
the normalization J2i a l — 1 i s unnecessary for (5a) and (5b). The proof is given in [4] and is also partly included 
in [11] in the scalar case. For completeness we include a short proof 3 . 

Proof: That (5a), (5b) are equivalent follows from the equalities ]T\ a J 2 A^(X l ) = J2 t a 1 N (Xi) = N (J2t 
To prove that (5c) is equivalent to (5a) we may assume that J2i a i = !• Taking logarithms of both sides of (5a), 
inequality (5c) follows from the concavity of the logarithm. Conversely, taking exponentials of both sides of (5c), 
inequality (5a) follows provided that the have equal entropies. But the latter condition is unnecessary because 

3 This proof corrects a small error in [4], namely, that the first statement in the proof of Theorem 7 in [4] is false when the Gaussian random 
vectors do not have identical covariances. 
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if (5a) is satisfied for the random vectors (N(X i )^ 1 ^ 2 X i )i of equal entropies, then upon modification of the 
coefficients it is also satisfied for the □ 
Inequality (5a) is equivalent to the classical formulation of the EPI [1] by virtue of the scaling property (2). 
Inequality (5b) is implicit in [1, App. 6], where Shannon's line of thought is to show that the entropy of the sum 
of independent random vectors of given entropies has a minimum where the random vectors are Gaussian with 
proportional covariance matrices. It was made explicit by Costa and Cover [7]. Inequality (5c) is due to Lieb [6] 
and is especially interesting since all available proofs of the EPI are in fact proofs of this inequality. It can be 
interpreted as a concavity property of entropy [4] under the covariance-preserving transformation 

{X i ) i ^Y = Y i a i X i £>? = 1). (6) 

i i 

Interestingly, (5c) is most relevant in several applications of the EPI. Although the preferred form for use in 
coding applications [14]-[26] is the inequality N(X + Z) > N(X) + N(Z), where Z is Gaussian independent 
of X, Liu and Viswanath [43], [44] suggest that the EPFs main contribution to multiterminal coding problems is for 
solving optimization problems of the form m&xx H(X) — [iH(X + Z), whose solution is easily determined from 
the convexity inequality (5c) as shown in Section VI. Also, (5c) is especially important for solving blind source 
separation and deconvolution problems, because it implies that negentropy C = —H satisfies the requirements for 
a "contrast function": 

C(%2 ai Xi) < maxC(Xi) a 2 = 1), (7) 

i i 

which serves as an objective function to be maximized in such problems [27]-[29]. Finally, the importance of the 
EPI for proving strong versions of the central limit theorem is through (5c) interpreted as a monotonicity property 
of entropy for standardized sums of independent variables [30], [34]. 

II. Earlier Proofs Revisited 

A. Fisher Information Inequalities (FII) 

Conventional information theoretic proofs of the EPI use an alternative quantity, the Fisher information (or a 
disguised version of it), for which the statements corresponding to (5) are easier to prove. The Fisher information 
matrix 3(X) of a random n-vector X with density p(x) is [4], [9] 

J(X) = Cov(S(X)) (8) 

where the zero-mean random variable (log-derivative of the density) 

S(X) = Wlogp(X) = ^l (9) 

is known as the score. The Fisher information J(X) is the trace of (8): 

W = Va r( S (X))=E!!|W (10) 
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In this and the following subsections, we assume that probability densities are sufficiently smooth with sufficient 
decay at infinity so that Fisher informations exist, possibly with the convention that J(X) = oo if X has a 
probability mass assigned to one or more singletons in W 1 . The scaling properties 

S(aX) = a^SiX) 

(11) 

J{aX) = a- 2 J{X) 

follow from the definitions by a change of variable argument. Note that if X has independent entries, then J(X) 
is the diagonal matrix J(X) = diag 

It is easily seen that the score S(X) is a linear function of X iff X is Gaussian. Therefore, a measure of 
non-Gaussianness of X is the mean-square error of the score with respect to the (linear) score S* of a Gaussian 
random vector X* with identical second moments: 

E(\\S(X) - S*(X)f) = J(X) - J(X*) (12) 

where J(X*) = tr (Cov(X) -1 ). Since (12) is nonnegative and vanishes iff X is Gaussian, the Fisher informa- 
tion (10) satisfies the inequalities 

J(X) > tr (Cov(X)- 1 ) > \. (13) 

a x 

The first inequality (an instance of the Cramer-Rao inequality) holds with equality iff X is Gaussian, while the 
second inequality (a particular case of the Cauchy-Schwarz inequality on the eigenvalues of Cov(X)) holds with 
equality iff X is white. In particular, nJ^ 1 (X) is the power of a white Gaussian random vector having the same 
Fisher information as X. 

Proposition 2 (Equivalent FIIs): The following inequalities, each stated for any finitely many independent ran- 
dom vectors (Xj), with differentiable densities and real-valued coefficients (<ij)j, are equivalent. 

J-^^OiX^^^ap-^Xi), (14a) 

i i 

i£aA)< jfcoiXi), (14b) 

i i 

^ E a * (E a i = x )' (14c) 

i i i 

where the are independent Gaussian random vectors with proportional covariances (e.g., white) and corre- 

sponding Fisher informations J(Xi) = J(Xi). 

There is a striking similarity with Proposition 1. The proof is the same, with the appropriate changes — the 
convexity of the hyperbolic 1/x is used in place of the concavity of the logarithm — and is omitted. Inequality (14c) 
is due by Stam and its equivalence with (14a) was pointed out to him by de Bruijn [2]. It can be shown [49], [50] 
that the above inequalities also hold for positive semidefinite symmetric matrices, where Fisher informations (10) 
are replaced by Fisher information matrices (8). 

Similarly as for (5c), inequality (14c) can be interpreted as a convexity property of Fisher information [4] under 
the covariance-preserving transformation (6), or as a monotonicity property for standardized sums of independent 
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variables [30], [35]. It implies that the Fisher information C = J satisfies (7), and therefore, can be used as a 
contrast function in deconvolution problems [27]. The FII has also been used to prove a strong version of the 
central limit theorem [5], [30]-[35] and a characterization of the Gaussian distribution by rotation [49], [51]. 

B. Data Processing Inequalities for Least Squares Estimation 4 

Before turning to the proof the FII, it is convenient and useful to make some preliminaries about data processing 
inequalities for Fisher information and MMSE. In estimation theory, the importance of the Fisher information 
follows from the Cramer-Rao bound (CRB) [9] on the mean-squared error of an estimator of a parameter 8 £ R m 
from a measurement X £ R™. In this context, X is a random n-vector whose density pg(x) depends on 8, and the 
(parametric) Fisher information matrix is defined by [9], [50] 

J e (X) = Cov(Se(X)) (15) 
where Se(X) is the (parametric) score function, 

Sg(X) = -^logpe(X). (16) 
In some references the parametric Fisher information is defined as the trace of (15): 

Je(X) =Var(S e pO). (17) 

In the special case where 8 £ R™ is a translation parameter. p$(x) = p(x+8), we recover the earlier definitions (8)- 
(10): S(X) = S e {X - 8), 3{X) = J e (X - 8), and J(X) = J e (X - 8). More generally, it is easily checked that 
for any a £ R, 

S e (X - off) = aS(X) (18a) 
J e (X -a8) = a 2 J(X). (18b) 

The optimal unbiased estimator of 8 given the observation X, if it exists, is such that the mean-square error meets 
the CRB (reciprocal of the Fisher information) [9]. Such an optimal estimator is easily seen to be a linear function 
of the score (16). Thus it may be said that the score function Sg(X) represents the optimal least squares estimator 
of 8. When the estimated quantity 8 is a random variable (i.e., not a parameter), the optimal estimator is the 
conditional mean estimator E(6*|X) and the corresponding miminum mean-square error (MMSE) is the conditional 
variance \/ar(8\X). 

In both cases, there is a data processing theorem [50] relative to a transformation X — > Y in a Markov chain 
8 — > X — > Y, that is, for which Y given X is independent of 8. The emphasize the similarity between these 
data processing theorems and the corresponding quantities of Fisher information and MMSE, we first prove the 
following "chain rule", which states that the optimal estimation given Y of 8 results from the optimal estimation 
given Y of the optimal estimation given X of 8: 

4 We use the term "least squares estimation" for any estimation procedure based on the mean-squared error criterion. 
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Proposition 3 (Data Processing Theorem for Estimators): If 9 — > X — > F form a Markov chain, then 

E(0|F) - E(E(0|X)|F) (19a) 

S e (r) = E fl (S e (X)|y). (19b) 
Proof: In the nonparametric case the Markov chain condition can written as p(9\x,y) = p(6\x). Multiplying 
by 9p(x\y) gives 0p(6,x\y) — 9 p(9\x)p(x\y), which integrating over 9 and x yields (19a). In the parametric case 
the Markov chain condition can be written as pe(x, y) = p$(x)p(y\x) where the distribution p(y\x) is independent 
of 9. Differentiating with respect to 9 gives ^§§-{x, y) — ^§§-{x)p(y\x); dividing by pe(y) and applying Bayes' rule 
yields the relation ^-(x,y) /pg(y) = ^§§-{x) / 'pe(x) pg(x\y), which integrating over x yields (19b). □ 

From Proposition 3 we obtain a unified proof of the corresponding data processing inequalities for least squares 
estimation, which assert that the transformation X — > Y reduces information about 9, or in other words, that no 
clever transformation can improve the inferences made on the data measurements: compared to X, the observation 
Y yields a worse estimation of 9. 

Proposition 4 (Estimation Theoretic Data Processing Inequalities): If 9 — > X — > Y form a Markov chain, then 

Cov(0|F) > Cov(9\X) (20a) 
Je(Y)<J 9 (X). (20b) 

In particular, 

Var(6>|F) > Var(0|X) (21a) 
MY) < MX). (21b) 

Equality holds iff 

E(6»|X) = E{9\Y) a.e., (22a) 
S e (X) = S e (Y) a.e., (22b) 

respectively. 

Proof: The following identity ("total law of covariance") is well known and easy to check: 

Cov(C7) = Cov(I7|V) + Cov(E(U\V)). (23) 

For U = E(6\X) or U = S (X), and V = Y, we obtain, by Proposition 3, 

Cov(0|X) = Cm(0\Y) - Cov(E(9\X)\Y) (24a) 

3 e (X) = J e (Y) + Cw(So(X)\Y). (24b) 

where in deriving (24a) we have also used (23) for U = 9. Since covariance matrices are positive semidefinite, this 
proves (20a), (20b), and (21a), (21b) follow by taking the trace. Equality holds in (20a), (21a) or in (20b), (21b) 
iff E(9\X) or Sg(X) is a deterministic function of Y, which by Proposition 3 is equivalent to (22a) or (22b), 
respectively. □ 
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Stam [2] mentioned that (21b) is included in the original work of Fisher, in the case where Y is a deterministic 
function of X. A different proof of (20b) is provided by Zamir [50]. The above proof also gives, via (24a), (24b) or 
the corresponding identities for the variance, explicit expressions for the information "loss" due to processing. The 
equality conditions correspond to the case where the optimal estimators given X or Y are the same. In particular, 
it is easily checked that (22b) is equivalent to the fact that 6 — > Y — > X (in this order) also form a Markov chain, 
that is, Y is a "sufficient statistic" relative to X [9]. 

As a consequence of Proposition 4 we obtain a simple proof of the following relation between Fisher information 
and MMSE in the case where estimation is made in Gaussian noise: 

Proposition 5 ( Complementary Relation between Fisher Information and MMSE): If Z is Gaussian independent 
of X, then 

3(X + Z)Cov(Z) + Cov^^Covpf |X + Z) = I (25) 
In particular, if Z is white Gaussian, 

<r 2 z J{X + Z) + <r z 2 \/ar(X\X + Z) =n (26) 
Proof: Apply (24b) to the Markov chain 6 — > (X, Z - 6) — > X + Z - 6, where X and Z are independent 
of 6 and of each other. Since Sg(X, Z - 6) = S g (X) + Sg{Z -9) = S(Z) = -Cov(Z)- 1 (Z - E(Z)), we have 
J e (X, Z-6) = J{Z) = Cov(Z)- 1 . Therefore, (24b) reads 

Cov(Z)- 1 = 3(X + Z) + Cov(Z)" 1 Cov(Z|X + Z)Cov(Z)- 1 

Noting that Z - E(Z\X + Z) = E(A|A + Z) - X, one has Qov{Z\X + Z) = Cov(X\X + Z) and (25) follows 
upon multiplication by Cov(Z). For white Gaussian Z, (26) follows by taking the trace. □ 

As noted by Madiman and Barron [47], (26) is known in Bayesian estimation (average risk optimality): see 
[52, Thm. 4.3.5] in the general situation where X + Z is replaced any variable Y such that p(y\x) belongs to an 
exponential family parameterized by x. It was rediscovered independently by Budianu and Tong [53], and by Guo, 
Shamai and Verdui [10], [54]. Relation (25) was also rederived by Palomar and Verdu [55] as a consequence of 
a generalized de Bruijn's identity (Corollary 1 below). Other existing proofs are by direct calculation. The above 
proof is simpler and offers an intuitive alternative based on the data processing theorem. 

To illustrate (25), consider the case where X and Z are zero-mean Gaussian. In this case, the conditional 
mean estimator E(X\X + Z) is linear of the form A(X + Z), where A is given by the Wiener-Hopf equations 
ACov(X + Z) = E(X(X + Zf) = Cov(X). Therefore E{X\X + Z) = Cov(X)Cov(X + Z)- 1 {X + Z) = 
X + Z - Cov(Z)Cov(X + Z)~ 1 (X + Z). This gives, after some calculations, Cov{X\X + Z) = Cov(Z) - 
Cov(Z)Cov(A + Z)~ 1 Cov(Z). But this expression is also an immediate consequence of (25) since one has simply 
J(X + Z) = CoviX + Z)- 1 . 

For standard Gaussian Z, (26) reduces to the identity J(X + Z) + Var(X\X + Z) = n, which constitutes a 
simple complementary relation between Fisher information and MMSE. The estimation of X from the noisy version 
X + Z is all the more better as the MMSE is lower, that is, as X + Z has higher Fisher information. Thus Fisher 
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information can be interpreted a measure of least squares (nonparametric) estimation's efficiency, when estimation 
is made in additive Gaussian noise. 

C. Proofs of the FII via Data Processing Inequalities 

Three distinct proofs of the FII (14c) are available in the literature. In this section, we show that these are in fact 
variations on the same theme: thanks to the presentation of Section II-B, each proof can be easily interpreted as an 
application of the data processing theorem to the (linear) deterministic transformation (Xi)i i— > Y given by (6), or 
in parametric form: 

Y - = E a ^ - a ^ GC a * = (27) 

i i 

1) Proof via the Data Processing Inequality for Fisher Information: This is essentially Stam's proof [2] (see 
also Zamir [50] for a direct proof of (14a) by this method). Simply apply (21b) to the transformation (27): 

J e (%2 a * X * -°)^ J e(( X * - a i°)i) = 2 J °^ Xi ~ a ' e) (28) 

i i 

From (18b), the FII (14c) follows. 

2) Proof via Conditional Mean Representations of the Score: This proof is due to Blachman [3] in the scalar case 
(n = 1). His original derivation is rather technical, since it involves a direct calculation of the convolution of the 
densities of independent random variables U and V to establish that S(U + V) = E(XS(U) + (1 - X)S(V)\U + V) 
for any < A < 1, followed by an application of the Cauchy-Schwarz inequality. The following derivation is 
simpler and relies on the data processing theorem: By Proposition 3 applied to the transformation (27), 

SeiY, aiXi -0) = E(s e {(X t - a^) \ ]T -6)= E(]T S e (X t - a t 9)\ ]T a^) 

i i i i 

which from (18a) gives the following conditional mean representation of the score: 

S(Y / a i X i ) = E(J2a i S(X i )\J2a i X i ) (29) 

i i i 

This representation includes Blachman's as a special case (for two variables U = aiXi and V = a 2 X 2 ). The 
rest of Blachman's argument parallels the above proof of the data processing inequality for Fisher information 
(Proposition 4): His application of the Cauchy-Schwarz inequality [3] is simply an consequence of the law of total 
variance Var(J7) = Var(U\V) + Var(E(U\V)) . Indeed, taking U = ^a.S^), V = ^^X,, and using (29), 
the inequality Var({7) > Var(E(U\V)) reduces to the FII (14c). Thus we see that, despite appearances, the above 
two proofs of Stam and Blachman are completely equivalent. 

3) Proof via the Data Processing Inequality for MMSE: This proof is due to Verdu and Guo [11], which use 
MMSE in lieu of Fisher's information. Apply (21a) to the transformation (6), in which each Xi is replaced by 
Xi + Zi, where the (Zi)i are i.i.d. white Gaussian of variance a 2 . Noting Z = a^Zj, this gives 

Var(^ ai X t \ ^ a t X t + Z) > Var(^ a.X^X, + Z^) = ^ ^Var^X, + Z t ) (30) 

i i i i 
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where Z is also white Gaussian of variance a 2 . By the complementary relation (26) (Proposition 5), this inequality 
is equivalent to the FII J{J2i a i^i + Z) < a 2 J(Xi + Z{) and letting a 2 — > gives (14c) 5 . Again this proof is 
equivalent to the preceding ones, by virtue of the complementary relation between Fisher information and MMSE. 

4) Conditions for Equality in the FII: The case of equality in (14c) was settled by Stam [2] and Blachman [3]. In 
Stam's approach, by Proposition 4, equation (22b), equality holds in (28) iff J2i Se{Xi — aid) = Se(J2i a iXi — 0), 
that is, using (18a), 

J2 aiS{Xi) = S(J2 OiXi) a-e. (31) 

i i 

This equality condition is likewise readily obtained in Blachman's approach above. Obviously, it is satisfied only if 
all scores for which a% ^ are linear functions, which means that equality holds in the FII only if the corresponding 
random vectors are Gaussian. In addition, replacing the scores by their expressions for Gaussian random n-vectors 
in (31), it follows easily by identification that these random vectors have identical covariance matrices. Thus equality 
holds in (14c) iff all random vectors Xi such that a, ^ are Gaussian with identical covariances. 

Verdu and Guo do not derive the case of equality in [11]. From the preceding remarks, however, it follows that 
equality holds in (30) only if the (X, + for which at ^ are Gaussian — and therefore, the corresponding 
(Xi)i are themselves Gaussian. This result is not evident from estimation-theoretic properties alone in view of the 
equality condition (22a) in the data processing inequality for the MMSE. 

D. De Bruijn 's Identity 

1) Background: De Bruijn's identity is the fundamental relation between differential entropy and Fisher infor- 
mation, and as such, is used to prove the EPI (5c) from the corresponding FII (14c). This identity can be stated in 
the form [4] 

±H(X + StZ)\^ = \j(X) (32) 

where Z is standard Gaussian, independent of the random n- vector X. It is proved in the scalar case in [2], 
generalized to the vector case by Costa and Cover [7] and to nonstandard Gaussian Z by Johnson and Suhov 
[33], [42]. The conventional, technical proof of de Bruijn's identity relies on a diffusion equation satisfied by the 
Gaussian distribution and is obtained by integrating by parts in the scalar case and invoking Green's identity in the 
vector case. We shall give a simpler and more intuitive proof of a generalized identity for arbitrary (not necessarily 
Gaussian) Z: 

Proposition 6 (De Bruijn's Identity): For any two independent random n-vectors X and Z such that J(X) exists 
and Z has finite covariances, 

^H(X + VtZ) = itr(jpf)Cov(Z)). (33a) 
In particular, if Z is white or X has i.i.d. entries, 



dt 

5 This continuity argument is justified in [56] 



±H{X + VtZ)\ t=o = ±o%J{X). (33b) 
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2) A Simple Proof of De Bruijn's Identity: The proof is based on the following observation. Setting 9 = \ft, (33a) 
can be rewritten as a first-order Taylor expansion in 9 2 : 

H(X + 9Z) - H(X) = ^ E((Z - E(Z)Y3(X)(Z - E(Z))) + o(tf 2 ). (34) 

Now, there is a well-known, similar expansion of relative entropy (divergence) 

D x (p e \\p ei ) = E e log^f^ (35) 
Po 1 {X ) 

in terms of parametric Fisher information (15), for a parameterized family of densities pe{x), 9 € R m . Indeed, 
since the divergence is nonnegative and vanishes for 9' = 9, its second-order Taylor expansion takes the form [57] 

Dx(pe\\P0>) = \(P- OYMW -°) + oiP' - ^ll 2 ). (36) 

where 3e(X) is the positive semidefinite Hessian matrix of the divergence, that is, 3e(X) = D x(jpe\\p8') 
Eg -§Q2 log pg fx) » which is easily seen to coincide with definition (15) 6 . In view of the similarity between (34) 
and (36), the following proof of de Bruijn's identity is almost immediate. 

Proof of Proposition 6: Let Y = X + 9Z and write mutual information I(X + 9 Z; Z) = H(X + 9 Z) - H(X) 
as a conditional divergence: I(Y, Z) = D(p(y\z)\\p(y)) = E(D(px (y— 9Z)\\py(u)). Making the change of variable 
u = y — 9z gives I(X + 9Z; Z) = Ez(D(qo\\qe)), where qe(u) = px+ez(u + Oz) is the parameterized family of 
densities of a random variable U, and qo(u) — px{u). Therefore, by (36) for scalar 9, 

I(X + 9Z:Z) = ^E z (MU))+o(9 2 ), (37) 

where Jo{U) is the parametric Fisher information of U about 9 = 0, which is easily determined as follows. 

Expanding p(y\z) = px(y — Oz) about 9 = gives p(y\z) = px{y) — 9z t \7px(y) + o(0), and therefore, 
p(y) = E(p(y\Z)) = px(y) - 9 E(Z) t Wpx(y) + o(0), where the limit for 9 — > and the expectation have been 
exchanged, due to Lebesgue's convergence theorem and the fact that Z has finite covariances. It follows that 
qe{u) = p Y (u + Oz) = qa(u) + 9 (z — E{Z)) t Vpx{u) + o(0) so that the (parametric) score of U for 9 = 
is S (U) = ^logq e {U) = (z - E{Z)) t Vpx , ( ^ ) where ^ is the (nonparametric) score of X. Therefore, 



8=0 



J (U) = Var(5 (?7)) = (z - E(Z)) t J(X)(z - E(Z)). Plugging this expression into (37) gives (34) as required. 

□ 

In exploiting the parallelism between (34) and (36), this proof explains the presence of the 1/2 factor in de 
Bruijn's identity: this is merely a second-order Taylor expansion factor due to the definition of Fisher information 
as the second derivative of divergence. Besides, it is mentioned in [4] that (32) holds for any random vector Z 
whose first four moments coincide with those of the standard Gaussian; here we see that it is sufficient that this 
condition hold for the second centered moments (Cov(Z) = I). Also note that it is not required that Z have a 
density. Thus, (33) also holds for a discrete valued perturbation Z. 

6 Even though the divergence is not symmetric in (9,0'), it is locally symmetric in the sense that (36) is also the second-order Taylor 
expansion for D x (pe'\\Pe)- 
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3) The Gaussian Case: When Z is Gaussian, de Bruijn's identity (33) is readily extended to positive values of t. 
Simply substitute X + \fV Z' for X, where Z' is independent of Z with the same distribution. By the stability 
property of the Gaussian distribution under convolution, X + \fV Z' + \ft Z and X + \Jt + V Z are identically 
distributed, and, therefore, 

^-H{X + Vt Z) = \ tr (j(X + Vt Z) Cov(Z)) . (38a) 

For white Z, this reduces to 

j f H{X + ViZ) = X -a\ J(X + ViZ). (38b) 

Such a generalization cannot be established for non-Gaussian Z, because the Gaussian distribution is the only stable 
distribution with finite covariances. Using the complementary relation (25) of Proposition 5 and making the change 
of variable t' = 1/t, it is a simple matter of algebra to show that (38a) is equivalent to 

^-H{VtX + Z) = ^ tr (Cov(Z)" 1 Qov{X\ViX + Z)). (39a) 

Since Cov(Z) -1 = J(Z), this alternative identity also generalizes (33a) (with X and Z interchanged). For white 
Z, it reduces to 

^-H{VtX + Z) = -L-VarpTlViX + Z). (39b) 

(XL Zj(T ^ 

The latter two identites were thoroughly investigated by Guo, Shamai and Verdu [10]. The above proof, via de 
Bruijn's identity and Kullback's expansion (36), is shorter than the proofs given in [10], and also has an intuitive 
interpretation, as shown next. 

4) Intuitive Interpretations: Expansions (34) and (36) can be given similar interpretations. In (36), Dx(pe\\p$') 
has local parabolic behavior at vertex 9 = 9' with curvature = Je(X), which means that for a given (small) value 
of divergence, 9 is known all the more precisely as Fisher information Je(X) is large (see Fig. 1). This confirms 
that Jg(X) is a quantity of "information" about 9. Similarly, (34) shows that the mutual information I(X + 9Z; Z) 



Dx(Pd\\P9') 




9 

Fig. 1. Kullback-Leibler divergence drawn as a function of the estimated parameter for (a) low and (b) high value of Fisher information. 

between the noisy version X + 9 Z of X and the noise Z, seen as a function of the noise amplitude, is locally 
parabolic about 9 = with curvature = J(X). Hence for a given (small) value of noise amplitude 9q, the noisy 
variable is all the more dependent on the noise as J(X) is higher (see Fig. 2). Therefore, de Bruijn's identity merely 
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I{X + 0Z; Z) 




o e 

Fig. 2. Mutual information between a noisy variable and the noise, drawn as a function of noise amplitude 9 for (a) low and (b) high value 
of the variable's Fisher information. 

states that Fisher information measures the sensitivity to an arbitrary additive independent noise, in the sense that a 
highly "sensitive" variable, perturbed by a small additive noise, becomes rapidly noise-dependent as the amplitude 
of the noise increases. This measure of sensitivity of X depends the noise covariances but is independent of the 
shape of the noise distribution otherwise, due to the fact that de Bruijn's identity remains true for non-Gaussian 
Z. Also, by the Cramer-Rao inequality (13), a Gaussian variable X* has lowest sensitivity to an arbitrary additive 
noise Z. Thus the saddlepoint property of mutual information I(X + Z; Z) > I(X* + Z; Z), classically established 
for Gaussian Z [9], [58], [59] (see also Proposition 7 below), is seen to hold to the first order of erf for an arbitrary 
additive noise Z. 

A dual interpretation is obtained by exchanging the roles of X and Z in (33a) or (34) to obtain an asymptotic 
formula for the input-output mutual information I(X; \/tX + Z)ina (non-Gaussian) additive noise channel X i— » 
y/iX + Z for small signal-to-noise ratio (SNR). In particular, for i.i.d. input entries or if the channel is memoryless, 
either Cov(X) or J(Z) is proportional to the identity matrix and, therefore, 

I(X; VtX + Z) = ^J(Z)a%t + o(t) (40) 

Thus, as has been observed in, e.g., [10], [60], [61], the rate of increase of mutual information per unit SNR is 
equal to \ J{Z) in the vicinity of zero SNR, regardless of the shape of the input distribution (see Fig. 3). In the case 
of a memoryless channel, it is also insensitive to input memory, since in this case (40) still holds for correlated 
inputs. Again by the Cramer-Rao inequality (13), the Gaussian channel exhibits a minimal rate of increase of 
mutual information, which complies with the well-known fact that non-Gaussian additive noise channels cannot 
have smaller capacity than that of the Gaussian channel. 

5) Applications: Apart from its role in proving the EPI, de Bruijn's identity (Proposition 6) has found many 
applications in the literature, although they were not always recognized as such. The Taylor expansion for non- 
Gaussianness corresponding to (34) in the scalar case in — 1) is mentioned, albeit in a disguised form, by Linnik 
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I(X; y/iX + Z) 




t 



Fig. 3. Input-output mutual information over an additive noise channel, drawn as a function of SNR for small SNR and standard Z. (a) Gaussian 
channel J(Z) = 1. (b) Laplacian channel J(Z) = 2. 



[62] who used it to prove the central limit theorem. Itoh [63] used Linnik's expansion to characterize the Gaussian 
distribution by rotation. Similar expansions have been derived by Prelov and others (see, e.g., [56], [64]-[75]) to 
investigate the behavior of the capacity or mutual information in additive Gaussian or non-Gaussian noise channels 
under various asymptotic scenarios. In particular, (40) was apparently first stated explicitly by Pinsker, Prelov and 
van der Meulen [56]. A similar result was previously published by Verdu [76] (see also [77]) who used Kullback's 
expansion (36) to lower bound the capacity per unit SNR for non-Gaussian memoryless additive noise channels, a 
result which is also an easy consequence of (40). Motivated by the blind source separation problem, Pham [78] (see 
also [79], [80]) investigated the first and second-order expansions in 9 of entropy for non-Gaussian perturbation Z 
(not necessarily independent of X) and recovers de Bruijn's identity as a special case. Similar first and second- 
order expansions for mutual information in non-Gaussian additive noise channels were derived by Guo, Shamai 
and Verdu [61], yielding (40) as a special case. 

6) Generalized De Bruijn's Identity: Palomar and Verdu [55] proposed a matrix version of de Bruijn's identity 
by considering the gradient of H(X + Z) with respect to the noise covariance matrix Cov(Z) for Gaussian Z. We 
call attention that this is a simple consequence of (33a); the generalization to non-Gaussian Z is as follows. 

Corollary 1: 



H(X + Z) =h(X), (41) 

K=0 2 



d 

dK 

where we have noted K = Cov(Z). 

Proof 7 : By (33a), we have the following expansion: 

H(X + Z) = ±tr(3(X)K)+o(\\K\\) 

h where ||K|| denotes the Frobenius norm of K = K*. But this is of the form of a first-order Taylor expansion of 
a function with respect to a matrix 8 : 

/(K) = /(0)+tr(^(0).K*)+o(||K||), 

7 The 1/2 factor is absent in [55], due to the fact that complex gradients are considered. 

8 Putting the matrix entries into a column vector k it is easily found that tr (tjj(O) ■ K*) = k'4jj(0). 
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and (41) follows by identifiying the gradient matrix. □ 
7) Relationship between the Cramer-Rao Inequality and a Saddlepoint Property of Mutual Information: The 

following saddle point property of mutual information, which was proved in [81] using a result of Pinsker [82], 

states that the worst possible noise distribution in a additive noise channel is the Gaussian distribution. 

Proposition 7: Let X be any random vector, and let X* be a Gaussian random vector with identical second 

moments. For any Gaussian random vector Z independent of X and X* , 

I(X + Z;Z) >I(X* + Z;Z). (42) 

Proof (following [59]): Noting that Y* = X* + Z has identical second moments as Y = X + Z, we have 
I(X + Z-Z)- I{X* +Z:Z) = H(Y) - H(X) - H(Y*) + H(X*) = D(X\\X*) - D(Y\\Y*). The result follows 
by the data processing inequality for divergence, applied to the transformation X — » Y = X + Z. □ 
This proof, in constrast to that given in [9], [58] for scalar variables, does not require the EPI, and is through a 
much less involved argument. 

Interestingly, by virtue of de Bruijn's identity, it can be shown that (42) is equivalent to the famous Cramer-Rao 
inequality 9 

3(X) > JpT) = Cov(X)- 1 . (43) 

To see this, divide both sides of (42) by the entries of Cov(Z) and let Cov(Z) — > 0. By Corollary 1, this gives 
iJ(X) > iJ(X*). Conversely, integrating the relation ±tr (J(X + Z)Qov(Z)) > ±tr (J(X* + Z)Qov(Z)) using 
de Bruijn's identity (38) readily gives (42). 

E. Earlier Proofs of the EPI 

All available information theoretic proofs of the EPI use de Bruijn's identity to integrate the FII (or the corre- 
sponding inequality for MMSE) over the path of a continuous Gaussian perturbation. To simplify the presentation, 
we first consider a path of the form {X + \ft Z}te(o : +oo[ where Z is assumed standard Gaussian. The derivations 
in this section are readily extended to the case where Z is arbitrary Gaussian, by means of the corresponding 
generalized FII and de Bruijn's identity. 

1) Basic Proof: The following is a simplified version of Stam's proof [2]. Apply the FII (14c) to the random 
vectors (Xi + \/tZi)i, where the (Zi)i are independent and standard Gaussian. This gives JQ^j a%Xi + \ft Z) — 
^2iafJ(Xi + \JtZi) < 0, where Z = ^ i aiZi is also standard Gaussian. By de Bruijn's identity (38b), it 
follows that f(t) = H^idiXi + ViZ) - J2i a i H ( x i + V* z i) is a nonincreasing function of t. But f(t) = 
Hit- 1 / 2 Y,i a-iX, + Z)-J2i a 2 H(t- 1 / 2 X i + Z,) tends to H{Z) - ^ a 2 H{Zi) = as t -» oo (see Lemma 2 
below). Therefore, /(0) > /(oo) = 0, which is the EPI (5c). 

Note that the case of equality in (5c) is easily determined by this approach, since it reduces to the case of equality 
in the corresponding FII (see Section II-C.4). Namely, equality holds in the EPI (5c) iff all random vectors Xi for 

9 This follows from the relation J(X) - J(X*) = Cov(S(X) - S*(X)) > 0, where S*(X) is defined as in (12). 
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which en =^ are Gaussian with identical covariances. It follows that equality holds in the classical form of the 
EPI (5a) iff all random vectors Xi for which a; ^ are Gaussian with proportional covariances. 

2) Integral Representations of Differential Entropy: In the above proof, de Bruijn's identity can be rewritten as 
an integral representation of entropy. To see this, introduce an auxiliary Gaussian random vector X* , and rewrite de 
Bruijn identity (38b) in the form 10 j- t {H(X* + sftZ)-H(X + sftZ)) = -\(j(X + s/iZ)-J(X* + sftZ)). Since 
H(X* + ViZ)- H(X + VtZ) -f as t —> oo, we may integrate from t = to +oo to obtain H(X) - H(X*) 
as the integral of J(X + \ft Z) — J(X* + \ftZ). If, for example, Z* is chosen standard, one obtains the integral 
representation [46] 

n 1 f°° n 

H(X) - - log(27re) = -- J j( X + V~tZ)-—dt (44a) 

In view of this identity, the EPI (5c) immediately follows from the corresponding FII (14c). 

3) Other Paths of Integration: Several variants of the above proof were published, either in differential or 
integral form. Dembo, Cover and Thomas [4] and Carlen and Soffer [5] use a path connecting Z to X of the form 
{VtX + VI — tZ} te ( Q .iy The argument leading to the EPI is the same up to an appropriate change of variable. 
The corresponding integral representation 

H (X) = J log(27re) - \ f J(Vt X + y/T=t Z) - n% (44b) 
2 2 J t 

was first used by Barron [30] to prove a strong version of the central limit theorem. Verdu and Guo [11] used the path 

{VtX + Z}te(0:+oo[ and replaced Fisher information by MMSE. They used (39b) to integrate inequality (30) over 

this path. Their proof is completely equivalent to Stam's proof above, by means of the complementary relation (26) 

of Proposition 5 and the change of variable t' = 1/t. The corresponding integral representation becomes [10]— [12] 

77 1 f'°° 77 

H(X) = - log(27re) - - / — — - Var(X|VtX + Z) dt. (44c) 
* * Jo 1 + £ 

Yet another possibility is to take the path {VI — tX + \ft Z}te(o ; i) connecting X to Z, leading to the following 
integral representation: 

H(X) = - log(27re) - - / n - -\/ar(X\VT~tX + Vt Z) —. (44d) 
2 2 Jq t t 

All the above representations for entropy are equivalent through appropriate changes of variable inside the integrals. 

III. A New Proof of Shannon's EPI 

A. A Mutual Information Inequality (Mil) 

From the analysis made in Section II, it is clear that earlier information theoretic proofs of the EPI can be seen 
as variants of the same proof, with the following common ingredients: 

1) a data processing inequality applied to the linear transformation (6). 

2) an integration over a path of a continuous Gaussian perturbation. 

10 When X* is chosen such that Cov(X*) = Cov(X), the identity relates nonnegative "non-Gaussiannesses" (3) and (12). 
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While step 1) uses the data processing theorem in terms of either parametric Fisher information or MMSE, step 
2) uses de Bruijn's identity, which relates Fisher information or MMSE to entropy or mutual information. This 
suggests that it should be possible to prove the EPI via a data processing argument made directly on the mutual 
information. The interest is two-fold: First, compared to the data processing theorem for Fisher information, the 
corresponding theorem for Shannon's mutual information is presumably more familiar to the readers of this journal. 
Second, this approach sidesteps both Fisher information and MMSE and avoids the use of de Bruijn's identity (38b) 
or (39b). 

We shall prove a stronger statement than the EPI, namely, that the difference between both sides of (5c) decreases 
as independent Gaussian noise Z is added. Since H(X + Z) — H(X) = I(X + Z; Z) for any X independent of 
Z, we write this statement in terms of mutual information as follows. 

Theorem 1 {Mutual Information Inequality (Mil)): For any finitely many independent random n-vectors {X{)i, 
any real-valued coefficients normalized such that ^\ af = 1, and any Gaussian n-vector Z independent of 
(Xi)i, 

l(^a i X i + Z-Z)<Y j a 2 iI{X i + Z-Z). (45) 

i i 

Furthermore, this inequality implies the EPI. 

The Mil (45) can be interpreted as a convexity property of mutual information under the covariance-preserving 
transformation (6). As we shall see, the crucial step in the proof of Theorem 1 is the data processing inequality for 
mutual information [9]. We also need the following technical lemmas. 

Lemma 1 (Sato's Inequality): If the random vectors (Xi)i are independent of Z and of each other, then 

l{(X i + Z) i ;Z)<Y,I{X i + Z;Z). (46) 

i 

Proof: Let Yi = X; t + Z for all i, Y = (Yi)i and define the symmetric mutual information between the 
components of Y by the divergence 

/{(Y^} = E log ^gly = g W; ¥,,...,¥,_,) (47) 

From the definitions it is obvious that /((Y;);; Z) - £\ I{ Y i: z ) = J {( Y i)i\ z } - !{{ Y i)i}- The result follows since 
I{(Yi)i} > and I{{Yfr\Z} - I{(X^} = 0. □ 

Inequality (46) was proved for two variables by Sato [83] who used it to derive an outer bound to the capacity 
region of broadcast channels. A similar inequality appears in [84, Thm. 4.2.1] and in [85, Thm 1.9]. 

Lemma 2: If X and Z are independent random n-vectors and lim e ^ X + eZ = X in distribution, then 

lim I(X + eZ\ Z) = 0. (48) 

If, in addition, I(X + eZ; Z) is twice differentiable in a neighborhood of e — 0, then I(X + eZ; Z) = 0(e 2 ) and, 
therefore, for any a G M, 

I(X + aeZ; Z) = a 2 I(X + eZ; Z) + o(e 2 ). (49) 
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Proof: To prove (48) we invoke the lower semicontinuity of divergence (3) as in [30]: liminf e ^ D(X + 
eZ\\X* +eZ*) > D(X\\X*), where the Gaussian random vectors X* and Z* have identical covariances as X and 
Z, respectively. Since lim £ ^ H(X* + eZ*) = H(X*), this inequality reduces to limsup e ^ I(X + eZ; Z) < 0. 
This combined with nonnegativity of mutual information proves (48) 11 . 

Let g(e) = I(X + eZ; Z) and assume g is twice differentiable about the origin. We have g(0) = since X and 
Z are independent, and g'(0) = since mutual information is nonnegative. Now (49) follows from the second-order 
Taylor expansions g(e) = ^V'(O) + o(e 2 ) and g(ae) = a 2 ^g"(0) + o(e 2 ). □ 
Note that neither Lemma 1 nor Lemma 2 requires Z to be Gaussian. 
Proof of Theorem 1: We can write the following string of inequalities: 

aiXi + Z;Z) = ai(Xi + a t Z):, Z) (50a) 

i i 

<I((Xi + aiZ)i;Z) (50b) 
<Y,I{^ + a t Z-,Z) (50c) 

i 

= J2^I(X l + Z;Z) + o(a 2 z ), (50d) 

i 

where (50a) holds since J2i a i = 1> (50b) follows from the data processing theorem applied to the linear trans- 
formation (6), (50c) follows from Sato's inequality (Lemma 1), and (50d) follows by applying (49) to s = a z , 
assuming that Z and Xi satisfy the differentiability assumption of Lemma 2 for all i. 

We now use the assumption that Z is Gaussian to eliminate the o(er|) term in (50d). Start with any independent 
random n-vectors and consider the function 

f(t) = aiXi + VtZ; Z)-J2 apiXi + y/tZ; Z). 

i i 

Let (Z!j)i be Gaussian, identically distributed as Z but independent of all other random vectors, and let X[ = 
Xi + ^/t Z\ for all i and t > 0. This perturbation ensures that densities are smooth, so that the (Xi + \ft Z[)i satisfy 
the differentiability assumption of Lemma 2. We may, therefore, apply (50d) to them, which gives 

I(J2 a ^ + ViZ' + V^Z;Z) <J2ajl(X l + VtZ' l + ^Z;Z) + o{e), (51) 

i i 

where Z' = aiZ[ is identically distributed as Z. By the stability property of the Gaussian distribution under 
convolution, \ft Z' + ^/e Z and the Z[ + -Je Z)i are identically distributed as \fi + e Z. Applying the following 
identity for independent X, Z, Z'\ 

I(X + Z' + Z;Z) = I{X + Z' + Z-Z 1 + Z)- I(X + Z'; Z'), (52) 

"This proof simplifies when X is Gaussian, since then I(X + eZ; Z) < H(X* + eZ*) - H(X) -> H(X*) - H(X) = as e -> 0. 
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which holds since H(X + Z' + Z) - H(X + Z') = H(X + Z' + Z) - H(X) - (H(X + Z') - H(X)), (51) is 
rewritten as 

a,X t + VtT^Z; Z) - (HXi + V~t Z; Z) <Y, a 1 {H^i + VtT^ Z; Z) - /(X, + sft Z: Z)) + o(e), 

i i i 

that is, f(t + e) < f(t) + o(e). It follows that / is nonincreasing in t > 0. Also, by Lemma 2, lim £ ^ /(e) = 
/(0) = 0. Therefore, /(l) < /(0) = 0, which is the required Mil (45). 

Finally, we show that the Mil implies the EPI (5c). Since £\ o? = 1 and /(X + Z; Z) = H(X + Z) - H{X) = 
I(X; X + Z) + H{Z) - H{X) for any X independent of Z, the Mil can be rewritten as 

H(J2 a i x i) - }Z a * H ( X <) ^ J (E a ^ zZ a * X * + Z )-/Z a P( X i-> X i + z )- (53) 

i i i i i 

Now replace Z by y/tZ and let i — > oo. The terms in the right-hand side of the above inequality are of the form 
I(X; X + VtZ) = I(X; X + Z), which tends to zero as t — > oo by Lemma 2. This completes the proof. □ 

B. Insights and Discussions 

1 ) Relationship to Earlier Proofs: Of course, Theorem 1 could also be proved using the conventional techniques 
of Section II. In fact, it follows easily from either one of the integral representations (44). Also Lemma 2 is an 
easy consequence of de Bruijn's identity, since by (34), both sides of (49) are equal to ^-tr (Cov(aZ)J(X)) = 
^|^-tr (Cov(Z)J(X)). The originality here lies in the above proof of Theorem 1 and the EPI, which in contrast 
to existing proofs, requires neither de Bruijn's identity nor the notions of Fisher information or MMSE. 

The new proof shares common ingredients with earlier proofs of the EPI, namely items 1) and 2) listed at 
the beginning of this section. The difference is that they are used directly in terms of mutual information. As in 
section II-E.3, other paths of continuous Gaussian perturbation could very well be used, through suitable changes 
of variable. 

One may wonder if mutual informations in the form I(X; \JtX + Z) rather than I(X + \/tZ;Z) could be used 
in the above derivation of Theorem 1, particularly in inequalities (50). This would offer a dual proof, similarly as 
Verdu and Guo's proof is dual to Stam and Blachman's original proof of the EPI, as explained in section II. But 
a closer look at the above proof reveals that the dual approach would amount to prove (53), whose natural proof 
using the data processing inequality is through (50). Thus, it turns out that the two approaches amount to the same. 

Also note that by application of de Bruijn's identity, inequality (50d) reduces to the FII (14c) . Thus the Mil (45) 
implies both the EPI (5c) and the FII (14c). 

2) The Equality Case: Our method does not easily settle the case of equality in the MIL By the preceding 
remark, however, equality in (50d) implies equality in the FII (14c), which was determined in Section II-C.4. It 
follows that equality holds in the Mil (45) if and only if all random vectors Xi such that a, ^ are Gaussian with 
identical covariances. This result implies the corresponding necessity condition of equality in the EPI, but is not 
evident from the properties of mutual information alone. 



April 1, 2007 



DRAFT 



22 



3) On the Gaussianness of Z: It is interesting to note that from (50d), the Mil holds up to first order of a\, 
regardless of whether Z is Gaussian or not. However, the stability property of the Gaussian distribution under 
convolution was crucial in the next step of the proof, because the Gaussian perturbation Z can be made to affect 
the random vectors independently. In fact, the Mil can be easily rewritten as 

H(J2 OiXt) - J2 aMXi) > H(£ ai X[) - ]T a\H(X[) (54) 

i i i i 

where X- = X, + Zi for all i, the (Zi)i being independent copies of Z. This does not hold in general for non- 
Gaussian random vectors To see this, choose (X{)i themselves Gaussian with identical covariances. Then 
the left-hand side of (54) is zero, and by the necessity of the condition for equality in the EPI, the right-hand 
side is positive, as soon as Zi is non-Gaussian for some i such that a, ^ 0. Therefore, in this case, the opposite 
inequality is obtained. In other words, adding non-Gaussian noise may increase the difference between both sides 
of the EPI (5c), in accordance with the fact that this difference is zero for Gaussian random vectors. 

4) On the Use of Sato's Inequality: Sato used (46) and the data processing inequality to derive his cooperative 
outer bound to the capacity region of two-user broadcast channels [83]. This bound was used to determine the 
capacity of a two-user MIMO Gaussian broadcast channel [86]. Sato's bound was later replaced by the EPI to 
generalize Bergmans' solution to an arbitrary multi-user MIMO Gaussian broadcast channel using the notion of 
an "enhanced" channel [15]. In the present paper, the EPI itself is proved using Sato's inequality and the data 
processing inequality. This suggests that for proving converse coding theorems, a direct use of the EPI may be 
avoided by suitable inequalities for mutual information. A similar remark goes for the generalization of Ozarow's 
solution to vector Gaussian multiple descriptions [87]. 

5) Relationship Between Various Data Processing Theorems: Proposition 4 enlightens the connection between 
two estimation theoretic data processing inequalities: parametric (Fisher information) and nonparametric (MMSE). 
While these were applied in earlier proofs of the EPI, the new proof uses the same data processing argument in 
terms of mutual information: any transformation X — > Y in a Markov chain 9 — > X — > Y reduces information 
about 9. This can also be given a parametric form using divergence (35). Thus, if 9 — > X — > Y form a Markov 
chain, then 

1(0, Y) < 1(9, X) (55a) 

D Y (pe\\pe>) < D x (pe\\pe>). (55b) 

As in Proposition 4, the first data processing inequality involves a random variable 9, while the second considers 
9 as a parameter. The proof is immediate from the chain rules 1(9; Y) + 1(0; X\Y) = 1(9; X, Y) — 1(9; X) and 
D Y (pe\\pe>)+D x\Y(pe\\pe>) = D x ,Y(pe\\pe>) = D x (pe\\Pe') where by the Markov chain condition, 1(9; Y\X) = 
and D Y \x(pe\\pe') — 0, respectively. 

Comparing the various proofs of the EPI presented above, it is clear that, as already suggested in Zamir's 
presentation [50], estimation theoretic and information theoretic data processing inequalities are strongly related. 
Also note that in view of (36), the lesser known data processing inequality for Fisher information (20b) is an 
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immediate consequence of the corresponding inequality for divergence (55b). Indeed, dividing both sides of (55b) 
by \\6 — 9'\\ 2 and letting 9' — > 9 gives (20b). It would be interesting to see if the various data processing inequalities 
(for mutual information, divergence, MMSE, and Fisher information) can be further unified and given a common 
viewpoint, leading to new insights and applications. 

6) On the EPI for Discrete Variables: The above proof of the Mil does not require the to be random 

vectors with densities. Therefore, it also holds when the random vectors are discrete (finitely or countably) valued. 
In fact, Verdu and Guo [11] used [10, Lemma 6, App. VII] to show that the EPI (5c) also holds in this case, where 
differential entropies are replaced by entropies. We call attention that this is in fact a immediate consequence of 
the stronger inequality 



for any independent discrete random vectors and any real-valued coefficients (aj)j, which is easily obtained 

by noting that H(^2 li a l X i ) > HQ2 i aiX i \(Xj)j^ i ) = H(Xi) for all i. Note, however, that the classical EPI 
in the form exp ^H(J2 i X i ) > X)i ex P^^(^) does not hold in general for discrete random vectors — a simple 
counterexample is obtained by taking deterministic Xi for all i. 



A. Background 

Zamir and Feder [38]-[40] generalized the scalar EPI by extending the linear combination J2i a iXi of random 
variables to an arbitrary linear transformation AX, where X is the random vector of independent entries (Xj)j 
and A = (ai,j)i,j is a rectangular matrix. They showed that the resulting inequality cannot be derived by a 
straightforward application of the vector EPI of Proposition 1. They also noted that it becomes trivial if A is 
row-rank deficient. Therefore, in the following, we assume that A has full row rank. 

Zamir and Feder's generalized EPI (ZF-EPI) has been used to derive results on closeness to normality after linear 
transformation of a white random vector in the context of minimum entropy deconvolution [38] and analyze the 
rate-distortion performance of an entropy-coded dithered quantization scheme [88]. It was also used as a guide to 
extend the Brunn-Minkowski inequality in geometry [89], [90], which can be applied to the calculation of lattice 
quantization bit rates under spectral constraints. 

The equivalent forms of the ZF-EPI corresponding to those given in Proposition 1 are the following. 

Proposition 8 (Equivalent ZF-EPIs): The following inequalities, each stated for any random (column) vector X 
of independent entries (X, ) j with densities and real-valued rectangular full row rank matrix A, are equivalent. 



H(y~aiXi) > max H(Xi) 

— ' i 



IV. Zamir and Feder's EPI for Linear Transformations 



N(AX) > |Adiag(iV(X,)),A| 1 / r , 



(56a) 



H(AX) > H(AX), 



(56b) 





(56c) 
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where r is the number of rows in A, and the components of X — (Xj)j are independent Gaussian random variables 
of entropies H(Xj) = H(Xj). 

The proof of Proposition 8 is a direct extension of that of Proposition 1. That (56a), (56b) are equivalent 
follows immediately from the equalities |Adiag(iV(Xj))j A^ = |A diag (iV(A^))j Al 1 ^ = | ACov(X)A*| 1 /'' = 
|Cov(AX)| 1 / r = N(AX). The implication (56c) => (56a) is proved in [40], and the equivalence (56b) ^=4> (56c) 
is proved in detail in [12]. 

Similarly as for (5c), inequality (56c) can be interpreted as a concavity property of entropy under the variance- 
preserving 12 transformation 

X — > AX (AA 1 = I) (57) 

and is the golden door in the route of proving the ZF-EPI. The conventional techniques presented in Section II 
generalize to the present situation. One has the following Fisher information matrix inequalities analogous to (56): 

J" 1 (AX) > AJ- 1 (X)A t , (58a) 
J(AI) < J(AX), (58b) 
J(AI) < AJ{X)A t (AA* = I), (58c) 

where the components of X = (Xj)j are independent Gaussian variables with Fisher informations J(Xj) — J(Xj) 
for all j. The first inequality (58a) was derived by Papathanasiou [49] and independently by Zamir and Feder [38], 
[40], who used a generalization of the conditional mean representation of score (see Section II-C.2); their proof is 
simplified in [91], [92]. Later, Zamir [50] provided an insightful proof of (58) by generalizing Stam's approach (see 
Section II-C.l) and also determined the case of equality [91], [93]. Taking the trace in both sides of (58c) gives 

J(A-X) < ^2 J{Xj ) (AA 1 = I) , (59) 

i,j 

which was used by Zamir and Feder [40], [50] to prove the ZF-EPI (56c) by integration over the path {ViX + 
\/l - tZ} (see Section II-E). Finally, Guo, Shamai and Verdu [12] generalized their approach (see Section II-C.3) 
to obtain the inequality 

Var(AX| AX + Z) > ^ Van.V., \X S + Z,), (60) 

where Z and the (Zj)j are standard Gaussian independent of X, and used it to prove the ZF-EPI (56c) by integration 
over the path {VtX + Z} (see Section II-E). Again the approaches corresponding to (59) and (60) are equivalent 
by virtue of the complementary relation (26), as explained in section II-C.3. 

12 If the (Xj)j have equal variances, then so have the components of AX, since Cov(X) = <r 2 I implies Cov(AX) = ACov(X)A* = 
a 2 A A 4 = <t 2 I. 



April 1, 2007 



DRAFT 



25 



B. A New Proof of the ZF-EPI 

The same ideas as in the proof of Theorem 1 are easily generalized to prove the ZF-EPI. 

Theorem 2 (Mutual Information Inequality for Linear Transformations): For any random vector X with inde- 
pendent entries (Xj)j, any real-valued rectangular matrix A with r orthonormal rows (AA* = I), and any standard 
Gaussian random r-vector Z and variable Z independent of X, 

I(AX + Z;Z)<J2ahnXj + Z;Z). (61) 

Furthermore, this inequality imply the ZF-EPI. 

Proof: Noting Z' = A*Z, a Gaussian random vector with the same dimension as X, we can write the following 
string of inequalities: 

I(AX + Z;Z) = I(A(X + Z');Z') (62a) 
<I(X + Z';Z') (62b) 

3 

where (62a) holds since AA* = I, (62b) follows from the data processing theorem applied to the linear transforma- 
tion (57), and (62c) follows from Sato's inequality (Lemma 1). Now apply the resulting inequality to X = X+y/tZ, 
where t > and Z is a standard Gaussian random vector independent of all other random variables, and replace 
Z by yfi Z, where e > 0. This gives 

I(AX + y/tAZ + y/eZ; Z) K^HXj + y/tZ, + y/e Z'^Z'j). 

3 

The Gaussian perturbation Z ensures that densities of the (Xj + y/t Zj )j are smooth, so that (48) of Lemma 2 
applies to the right-hand side. Noting that Cov(Z') — A* A and therefore, ct|, = J2i a i i f° r a ^ h we obtain 

I(AX + Vt AZ + y/i Z; Z) < ahH^j + V~t Zj + y/i Z; Z) + o(e) 

i,j 

where AZ is identically distributed as Z (since AA' = I), and Z is a standard Gaussian variable, independent of 
all other random variables. By the stability property of the Gaussian distribution under convolution, \Jt AZ + y/eZ 
is identically distributed as y/t + e Z, and the (y/t Zj + yfe Z)j are identically distributed as y/t + e Z. Therefore, 
applying (52) gives 

I [AX + y/TTeZ: Z) - I (AX + y/t Z; Z) < ^ a l 3 + + e z \ z ) - 1 { x j + VtT^ Z; Z)) + o(e) 

i,3 

which shows that 

f(t) = I(AX + V~tZ;Z)-J2 a%jI{Xj + V~tZ;Z) 

i,3 

is nonincreasing in t > 0. Also, by Lemma 2, lim t ^ /(*) = /(0) = 0. Therefore, /(l) < /(0) = 0, which proves 
the required Mil (61). 
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Finally, we show that (61) implies the ZF-EPI (56c). By the identity I(X+Z; Z) = I(X; X+Z)+H(Z)-H(X) 
for any X independent of Z, the Mil in the form f(t) < can be rewritten as 

H(AX) ^ «;,//■: A,: > I (AX; AX + V~tZ) ^ a;, /i A\: A, + sftZ) + A. 

where A = H(\/iZ) - £\ . a 2 t :j H(^/tZ) = rE(\[iZ) - rH{y/tZ) = 0. The other terms in the right-hand side 
of this inequality are of the form I(X + \fl Z;Z) = I(X: X + Z), which letting t — > oo tends to zero by 
Lemma 2. This completes the proof. □ 
Notice that the approach presented here for proving the ZF-EPI is the same as for proving the original EPI, 
namely, that the difference between both sides of the ZF-EPI (56c) is decreased as as independent white Gaussian 
noise is added: 

H(AX) - J2 aljHiXj) > H(AX') - ]T a%jH(Xfr (63) 

where X' = X + Z and Z is white Gaussian independent of X. 

Zamir and Feder derived their results for random variables Xj. However, our approach can be readily extended to 
random n-vectors. For this purpose, consider the random vector X = (Xj)j whose components Xj are themselves 
n-vectors, and adopt the convention that the components of Y = AX are n-vectors given by the relations Yi = 
J2j a i,jXj, which amounts to saying that A is a block matrix with submatrix entries (a,ijT)ij. The generalization 
of Theorem 2 is straightforward and we omit the details. The corresponding general ZF-EPI is still given by (56), 
with the above convention in the notations. 

V. Takano and Johnson's EPI for Dependent Variables 

A. Background 

Takano [41] and Johnson [42] provided conditions under which the EPI, in the form N(X 1 + X 2 ) > N(X±) + 
N(X 2 ), would still hold for dependent variables. These conditions are expressed in terms of appropriately perturbed 
variables 

X itt - Xi + y/Jtfj Z t (i = 1, 2) (64) 

where Z U Z 2 are standard Gaussian, independent of X = (X 1 ,X 2 ) t and of each other, and fi(t) and f 2 (t) are 
positive functions which tend to infinity as t — > oo. They involve individual scores S(Xi it ), S(X 2l t) and Fisher 
informations J{X\ ft ), J(X 2 j), as well as the entries of the joint score S(X t ) = (Si(X t ),S2(X t )y and the Fisher 
information matrix 

Jis(X t ) J\,i(X t ) 
Ji.^Xt) J 2t2 (X t ) 

where X t — (X i;t , X 2; t)*. Takano's condition is [41] 

^ E(S(X ht )S(X 2 ,)) ff S^-SjX^) S 2 (X t )-S(X 2 ,) ^\ 

J(x ht )j(x 2 . t ) - [X j(x M ) + j(x 2 , t ) ; J y ' 
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E 



for all t > 0. Johnson's improvement is given by the following weaker condition [42]: 

E(s(jr M )s(jr 2 , t )) 

JLY M )J(X 2 ,t) " 

7 (J 2 , 2 (X t ) - Ji, 2 (X t ))^(X t ) + (J M (X t ) - Ji, 2 (X t ))S 2 (X t ) g(X M ) gpx^t) 1 2 \ 
A Ji,i(X t )J 2 , 2 (X t ) - ^ j2 (X t ) J(X M ) J{X 2it )i ) 

for all i > 0. These conditions were found by generalizing the conventional approach presented in Section II, in 

particular Blachman's representation of the score (Section II-C.2). They are simplified below. The EPI for dependent 

variables finds its application in entropy-based blind source separation of dependent components (see e.g., [94]). 

B. A Generalized EPI for Dependent Random Vectors 

In this section, we extend Theorem 1 to provide a simple condition on dependent random n-vectors under 
which not only the original EPI AT(£\ X t ) > £\ iVpQ) holds, but also the EPI (5) for any choice of coefficients 
(<ij)j. Such stronger form should be more relevant in applications such as blind separation of dependent components, 
for it ensures that negentropy — H still satisfies the requirements (7) for a contrast objective function, for any type 
of linear mixture. Define 

X i>t =X i + ViZ i (67) 

corresponding to (64) with = t for all i. Our condition will be expressed in terms of symmetric mutual 

information I{(Xij)i} > defined by (47), which serves as a measure of dependence between the components of 
a random vector. 

Theorem 3: Let X = (Xi)i be any finite set of (dependent) random n-vectors, let X t = (-2f»,t)» be defined 
by (67), and let Z be a white Gaussian random n-vector independent of all other random vectors. If, for any t > 
and any real-valued coefficients (aj)j, adding a small perturbation a^Z to the Xi it makes them "more dependent" 
in the sense that 

I{(X itt + ai Z)i} > I{(X itt )i} + (4) (68) 

then the Mil (45) and the EPI (5) hold for these random vectors 

Proof: The only place where the independence of the is used in the proof of Theorem 1 is Sato's 

inequality (50c), which is used to the first order of a\ and applied to random vectors of the form (67) for all t > 0. 
Therefore it is sufficient that 

I((X ht + ai Z)^ Z)<J2 + <HZ\ Z) + o{a%) 

i 

holds for alH > and any choice of (aj)j to prove the Mil and hence the EPI. Now from the proof of Lemma 1, 
the difference between both sides of this inequality is 

I((X t + ai Z) t ; Z)-^ I(X t + a t Z- Z) - 7{pQ)0 - /{(X, + a^)J 

i 

The result follows at once. □ 
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Note that it is possible to check (68) for a fixed choice of the coefficients to ensure that the EPI (5c) holds 
for these coefficients. Of course, (68) is obviously always satisfied for independent random vectors In order 

to relate condition (68) to Takano and Johnson's (65), (66), we rewrite the former in terms of Fisher information 
as follows. 

Corollary 2: For random variables (Xi)i (n = 1), condition (68) is equivalent to 



for all t > 0, where X t = (X ltt ) l . Therefore, if this condition is satisfied then the Mil (45) and the EPI (5) hold. 

Proof: Let Z be a standard Gaussian random variable independent of X t , and define a = (eij)j and Y t — 
Xt + \/eaZ, where aZ — (aiZ)i. The perturbations (67) ensure that the density of X t is smooth, so that the 
function 1(e) — I{(Yi it )i} is differentiable for all e > 0. Now condition (68) is equivalent to the inequality 
1(e) > 1(0) + o(e), that is, 7'(0) > 0. By definition (47), 



diag (J(X ht )) i > J(X t ) 



(69) 



/{(r iit ) i } = 2ir(y iit )-ff(r t ), 



so the inequality I' 



(0) > can be rewritten as 




By de Bruijn's identity (33), this is equivalent to 



J2 a U( X i,t) > tr (3(X t )Cov(aZ)) 



where Cov(aZ) = aVar(Z)a* = aa 



,*, that is, 



a*-diag(J(X i;t )).-a>a* ■ 3(X t ) ■ a 



(70) 



for any vector a and t > 0. This shows that (68) is equivalent to the matrix inequality (69) as required. 



□ 



We now recover Takano and Johnson's conditions (65), (66) from (69). 



Lemma 3: In the case of two random variables X\, X2, conditions (65) and (66) are equivalent to 



X t -d\ag(J(X i . t )).-X>X t -J(X t )-X (71) 
A 4 • diag (J(Xi. t )). ■ A > M * • 3(X t ) ■ n, (72) 
respectively, where A and /1 minimize the quadratic forms a* • diag (j(X i;t )) i • a and a* • J(X t ) ■ a, respectively, 



over all vectors a of the form a = (a, 1 — a)*, < a < 1. 

Proof: Given a positive definite symmetric matrix J, the general solution a* to 



min a* • J • a 



a 




is easily found by the Lagrangian multiplier method. One finds a* — J2 3 -h } 1 2~2i j 'h / f° r a ^ * an d a*' • J • a* = 
Jr,j) > wnere J^j are th e entries of the inverse matrix J -1 . Particularizing this gives Ai oc J~ 1 (Xi >t ), 
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A 2 oc J 1 {X 2jt ) and /tii oc J 2j 2 (X t ) — Ji,2pQ), M2 oc Ji,i(X t ) — Ji,2(^t) up to appropriate proportionality factors, 
and (71), (72) are rewritten as 



J (X ht ) + J (X 2 ,) > + + 2 J{XiMX2tt) 

Ji,i(X t )J 2 ,2(X t ) - Jl 2 (x t ) 



(73) 



(Xm) + J (X2 ' t)} " J 1 , 1 (x t ) + J 2 Ax t )-2J h2 (x t ) (74) 

Meanwhile, expanding the right-hand sides in (65), (66) using Stein's identity [42] gives 

2(^i + v 2 ) - J-\X ht ) - J- 1 (X 2tt ) > vlJ h i{X t ) + v\j 2 . 2 (X t ) + 2v 1 v 2 J ia {X t ), 

where (ui, v 2 ) = (j~ 1 (X li t), J~ 1 (X 2it )) for Takano's condition and (ui, v 2 ) = (J 2 , 2 (X t ) - Ji i2 (X t ), Jis(X t ) - 
Ji, 2 (Xt)) / (Ji, i(X t )J 2 , 2 {X t ) — Ji t2 (X t )) for Johnson's condition. Replacing yields (73) and (74), respectively. 
This proves the lemma. □ 
Corollary 3: In the case of two random variables Xi,X 2 , condition (69) implies both Takano and Johnson's 
conditions (65), (66). 

Proof: Condition (69) implies (70) for any a of the form a = (a, 1 — a) 1 , < a < 1. Setting a = A yields 
Takano's condition (71). Replacing the right-hand side of the resulting inequality by the minimum over a (achieved 
by a — /i) gives Johnson's condition (72). □ 
Thus, our condition (69) is stronger than Takano's or Johnson's. This is not surprising since it yields a stronger 
form of the EPI (5), valid for any choice of coefficients (aj)j. 

VI. Liu and Viswanath's covariance-constrained EPI 

As already mentioned in the introduction, all known applications of the EPI to source and channel coding problems 
[14]-[26] involve an inequality of the form N(X + Z) > N(X) + N(Z), where Z is Gaussian independent of 
X. In this and the next section, we study generalizations of this inequality. We begin with Liu and Viswanath's 
generalized EPI for constrained covariance matrices. 

A. Background 

Recently, Liu and Viswanath [43], [44] have suggested that the EPFs main contribution to multiterminal coding 
problems is for solving optimization problems of the form 

maxH(X)- fiH(X + Z) (p > 1) (75) 

p(x) 

where Z is Gaussian and the maximization is over all random n-vectors X independent of Z. The solution is easily 
determined from the EPI in the form (5c) applied to the random vectors X\ = [j}l 2 X and X 2 = (1 — /i _1 ) _1 / 2 Z: 

H(X + Z)> »-\H{X) + 2 log/i) + (1 - M" 1 ^ ((1 - ^ V r 1/2 Z). 

Since equality holds iff X\ and X 2 are Gaussian with identical covariances, it follows that the optimal solution X 
to (75) is Gaussian with covariance matrix Cov(X) = (/i — l) _1 Cov(Z). 
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Clearly, the existence of a Gaussian solution to (75) is equivalent to the EPI for two independent random vectors 
X and Z. Liu and Viswanath [43], [44] have found an implicit generalization of the EPI by showing that (75) 
still admits a Gaussian solution under the covariance constraint Cov(X) < C, where C is any positive definite 
matrix. The gave a "direct proof," motivated by the vector Gaussian broadcast channel problem, using the classical 
EPI, the saddlepoint property of mutual information (42) and the "enhancement" technique for Gaussian random 
vectors introduced by Weingarten, Steinberg and Shamai [15]. They also gave a "perturbation proof" using a 
generalization of the conventional techniques presented in Section II, namely, an integration over a path of the form 
{VI — tX + \[tZ} of a generalized FII (14c) with matrix coefficients, using de Bruijn's identity and the Cramer- 
Rao inequality 13 . This and similar results for various optimization problems involving several Gaussian random 
vectors find applications in vector Gaussian broadcast channels and distributed vector Gaussian source coding [44]. 

B. An Explicit Covariance-Constrained Mil 

We first give explicit forms of covariance-constrained Mil and EPI, which will be used to solve Liu and 
Viswanath's optimization problem. Again, the same ideas as in the proof of Theorem 1 are easily generalized 
to prove the following covariance-constrained Mil and EPI, using only basic properties of mutual information. 

Theorem 4: Let X\ , X 2 be independent random n- vectors with positive definite covariance matrices, and let 
Zi , Z 2 be Gaussian random n-vectors independent of X\ , X 2 and of each other, with covariances proportional to 
those of X\ and X 2 , respectively: Cov(Zi) = aCov(Xi), Co\/(Z 2 ) = aCov(X 2 ), where a > 0. Assume that X 2 
is Gaussian and X\ , X 2 are subject to the covariance constraint 

Cov(Xi) < Cov(X 2 ). (76) 

Then for any real- valued coefficients a\, a 2 normalized such that a\ + a 2 = 1, 

/(aiXi + a 2 X 2 +Z;Z)< a?/(Xi + Z i; Z x ) + a 2 2 I(X 2 + Z 2 ; Z 2 ) (77) 

where we have noted Z = a\Z\ + a 2 Z 2 . Furthermore, this inequality implies the following generalized EPI: 

(ai-Yi + a 2 X 2 ) > a\H(X 1 ) + a 2 2 H(X 2 ) + A (78) 

where 

A = H{Z) - a\E(Z x ) - a 2 2 H(Z 2 ) > 0. (79) 
Note that for the particular case Cov(Xi) = Cov(Xi), we have Cov(Zi) = Cov(Z 2 ), the random vectors Z\,Z 2 
and Z are identically distributed, A = and Theorem 4 reduces to Theorem 1 for two random vectors. 
Proof of Theorem 4: First define 

Z\ = Cov(Z i )Cov(Z)" 1 Z (i = 1, 2) 

with covariance matrices Cov(Z' i ) = Cov(Zi)Cov(Z) _1 Cov(Zi). From (76), one successively has Cov(Zi) < 
Cov(Z 2 ), Cov(Zi) < a^Cov^) + a^Cov(Z 2 ) = Cov(Z), Cov(Z)- 1 < Cov _1 (Zi), and upon left and right 

13 As explained in Section II-D.7, the Cramer-Rao inequality (43) is equivalent to the saddlepoint property (42) used in their "direct proof". 
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multiplication by Cov(Zi), Cov(Z^) = Cov(Zi)Cov(Z)- 1 Cov(Zi) < Cov(Zi). Similarly, Cov(Z 2 ) > Cov(Z 2 ). 
Therefore, we can write 

z x = z[ + z x 

(80) 

Z'2 = Z2 + Z 2 

where Zi and Zi are Gaussian and independent of Z[ and Z 2 , respectively. We can now write the following string 
of inequalities: 

/(aiXi + a 2 X 2 + Z; Z) = /(oi(Xi + oiZj) + a 2 (X 2 + a 2 Z 2 ); Z) (81a) 

< /(Xi +a 1 Z' 1 ,X2 + a 2 Z' 2 ;Z) (81b) 

< 7(Xi + ai Z[- Z[) + I(X 2 + a 2 Z' 2 ; Z' 2 ) (81c) 
= I{X l + ouZt-Zx) + I(X 2 + a 2 Z 2 ; Z 2 ) 

- I(X 1 + ai (Z[ + Zx); Zi) + I(X 2 + a 2 (Z 2 + Z 2 ); Z 2 ) (8 Id) 

where (81a) holds since a\Z[ + a\Z 2 = Cov(Z)Cov(Z) _1 Z = Z, (81b) follows from the data processing theorem 
applied to the linear transformation (6), (81c) follows from Sato's inequality (Lemma 1), and (81d) follows by 
applying the identity (52) to the random vectors defined by (80). By Proposition 7, I(X\ + a\{Z' x + Zi); Z\) > 
I(X* + a 1 (Z[ + Zi); Zi), where X{ is Gaussian with covariance matrix Cov(X 1 *) = Cov(Xi). 

We now use the assumption that Cov(Zi) = aCov(Xi), Cov(Z 2 ) = aCov(X 2 ) and let a — > in the well-known 
expressions for mutual informations of Gaussian random vectors: 

I{X{ + a\{Z[ + Z\);Z\) = 1 log |I + Cov(^! ) Cov(^! ) - 1 1 + o(a) 

I(X 2 + a 2 (Z 2 + Z 2 ); Z 2 ) = ^ log |I + a a 2Cov(Z 2 )Cov(Z 2 )- 1 1 + o(a) 

where Cov(Zi) = Cov(Zi) - Cov(Z() = Cov(Zi) - Cov(Zi)Cov(Z)- 1 Cov(Zi) and Cov(Z 2 ) = Cov(Z 2 ) - 
Cov(Z 2 ) = Cov(Z 2 )Cov(Z)- 1 Cov(Z 2 )-Cov(Z 2 ). Since Cov(Z) = a\ Cov(Zi) + alCov(Z 2 ) and a\ + a\ = 1, we 
have ofCov^iJCov^i)- 1 = o?(I - Cov(Z 1 )Cov(Z)- 1 ) = a 2(Cov(Z 2 )Cov(Z)- 1 - I) = a2Cov(Z 2 )Cov(Z 2 )- 1 , 
and therefore, 

I(X 1 + ai (Z[ + Zt); Z x ) > I(Xl + ai (Z[ + Z x ); Z x ) 

= I(X 2 + a 2 (Z 2 + Z 2 ); Z 2 ) + o{a). 

It follows from (8 Id) that 

/(aiXi + a 2 X 2 + Z;Z)< I{X X + a x Z x ;Z x ) + I(X 2 + a 2 Z 2 ; Z 2 ) + o{a) (83) 

The rest of the proof is entirely similar to that of Theorem 1 . Here is a sketch. Write (83) for X x = X x + VI Z x 
and X2 = X 2 + \ft Z 2 , where Zj is identically distributed as Zi and independent of all other random vectors, for 
i = 1,2. Applying Lemma 2 to the right-hand side of the resulting inequality, this gives 

/(aiXi + a 2 X 2 + ViZ + V~eZ; Z) < a 2 1 I{X 1 + y/tZ 1 + sfcZ 1 ;Z 1 ) + a 2 2 I(X 2 + VtZ 2 + V^Z 2 ; Z 2 ) + o(e) 
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where Z is identically distributed as Z. By virtue of (52), this can be written in the form f(t + e) < f(t) + o(e), 
where 

f(t) = /(aiXi + a 2 X 2 + VtZ;Z)- a 2 I(X 1 +VtZ 1 ;Z 1 ) - a 2 I(X 2 + Vt Z 2 ; Z 2 ). 

Therefore, f(t) is nonincreasing, and /(l) < /(0), which is the required Mil (77). This in turn can be rewritten 
in the form 

H(a 1 X 1 + a 2 X 2 ) - a\H(X{) - a 2 2 H(X 2 ) > A + A' 

where A is defined by (79) and A' = I(a 1 X 1 + a 2 X 2 ;a 1 X 1 + a 2 X 2 + Z) - alliX^ X x + Z x )- a 2 2 I(X 2 ; X 2 + Z 2 ) 
tends to zero as a — > oo by Lemma 2. This proves (78) and the theorem. □ 
It is now easy to recover Liu and Viswanath's formulation. 

Corollary 4 (Liu and Viswanath [43], [44]): The maximization problem (75), subject to the covariance con- 
straint Co\/(X) < C, admits a Gaussian optimal solution X*. 

Proof: Let X* be the optimal solution to the maximization problem obtained by restricting the solution space 
within Gaussian distributions. Thus Co\/(X*) > maximizes 

ilog((27re)"|Cov(X)|) - | log((27re) n |Cov(X) + Cov(Z)|) 

over all covariance matrices Cov(X) < C. As stated in [44] and shown in [15], Cov(X*) must satisfy the Karush- 
Kuhn-Tucker condition 

icovLY*)- 1 = |(Cov(X*) + Cov(Z)) -1 + M, 

where M > is a Lagrange multiplier corresponding to the contsraint Cov(X) < C. It follows that Cov(X*) _1 > 
n(Cov(X*) + Cov(Z)y\ that is, ^CovpT) < Cov(X*) + Cov(Z), or Cov(^/ 2 X*) < Cov((l - pT^^Z). 

Now let X be any random vector independent of Z, such that Cov(X) = Cov(X*). Define a\ — [i^ 1 ! 2 , 
a 2 = (1 - p- 1 ) 1 ' 2 , X l = ^l 2 X, X 2 = (1 - n- x )- x l 2 Z and Z x = [l x I 2 X\ and let Z 2 be a Gaussian random 
vector identically distributed as X 2 and independent of Z\. Since a\ + a 2 = 1 and Cov(Xi) = Cov(Zi) < 
Cov(X 2 ) = Cov(Z 2 ), we may apply Theorem 4. By (78), we obtain 

H( ai X x + a 2 X 2 ) - a\H(X{) > H{a x Z x + a 2 Z 2 ) - a 2 H(Z Y ), 

that is, replacing and rearranging, 

H(X) - (iH(X + Z)< H(X*) - (iH(X* + Z). 

Therefore, the Gaussian random vector X* is an optimal solution to (75) subject to the constraint Cov(X) < C. 
This completes the proof. □ 
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VII. Costa's EPI: Concavity of Entropy Power 

A. Background 

Costa [36] has strengthened the EPI for two random vectors X, Z in the case where Z is white Gaussian. While 
it can be easily shown [4], [36] that Shannon's EPI for X, Z is equivalent to the inequality 

j t N(X + VtZ) > 1, 

Costa's EPI is the convexity inequality which expresses that the entropy power is a concave function of the power 
of the added Gaussian noise: 

d 2 

—N(X + VtZ)<0 (84) 

Alternatively, the concavity of the entropy power is equivalent to saying that the slope S(t) = (N(X + \ftZ) — 
N(X)) /t drawn from the origin is nonincreasing, while the corresponding Shannon's EPI is weaker, being simply 
equivalent to the inequality 5(1) > S(oo) = N(Z). 

The original proof of Costa through an explicit calculation of the second derivative in (84) is quite involved [36]. 
His calculations are simplified in [95]. Dembo gave an elegant proof using the FII over the path {X + y/tZ} [4], 
[37]. Recently, Guo, Shamai, and Verdu provided a clever proof using the MMSE over the path {VtX + Z} [12]. 

Costa's EPI has been used to determine the capacity region of the Gaussian interference channel [20]. It was 
also used as a continuity argument about entropy that was required for the analysis of the capacity of flat-fading 
channels in [96]. 

B. A New Proof of the Concavity of the Entropy Power 

In his original presentation [36], Costa proposed the concavity property N(X+\/t Z) > (l—t)N(X)+tN(X+Z) 
in the segment (0, 1) for white Gaussian Z, in which case he showed its equivalence to (84). He also established 
this inequality in the dual case where X is Gaussian and Z is arbitrary. In the latter case, however, this inequality 
is not sufficient to prove that N(X + \/iZ) is a concave function of t > 0. In this section, we prove a slight 
generalization of Costa's EPI, showing concavity in both cases, for an arbitrary (not necessarily white) Gaussian 
random vector. Again the proposed proof relies only on the basic properties of mutual information. 

Theorem 5 (Concavity of Entropy Power): Let X and Z be any two independent random n-vectors. If either X 
or Z is Gaussian, then N(X + \ft Z) is a concave function of t. 

Proof: To simplify the notation, let Z t = \ftZ. First, it is sufficient to prove concavity in the case where Z 
is Gaussian, because, as it is easily checked, the functions n(t) = N(X + Z t ) and t ■ n(l/t) — N(X t + Z) are 
always simultaneously concave. Next define 

Our aim is to prove (84), that is, f x (t) < 0. Consider the Mil (45) in the form 

I(X X + Y^ x + Z t - Z) < \I(X + Z t ; Z) + (1 - X)I(Y + Z t ; Z) 
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where Y is independent of X, Z and < A < 1. Replacing X\,Y\ by X, Y gives the alternative form 

I(X + Y + Z t ; Z) < XI(X + Z xt ; Z) + (1 - X)I(Y + Z (1 _ A)t ; Z) 
Choose Z' and Z" such that Z, Z' and Z" are i.i.d. and independent of X, and replace X by X + Z' u and Y by 

I(X + Z' U + Z'J + Z t ; Z) < A/(X + Z' u + Z At ; Z) + (1 - A)/(Z^ + Z (1 _ A)t ; Z). (85) 

We now turn this into a "mutual information power inequality" similarly as the EPI (5a) is derived from (5c) in the 
proof of Proposition 1. Define Mt(X) as the power of a Gaussian random vector X having covariances proportional 
to those of Z and identical mutual information I(X + Zt;Z). By Shannon's capacity formula, I(X + Zt,Z) = 
| log(l + ta%/a 2 ~), and therefore 



, , t&7 tcr z 

M t (X) = 



exp£/(X + Z t ;Z)-l fx{t)-l 
Choose A e [0, 1] such that I(X + Z' u + Z At ; Z) = I(Z'J + Z (1 _ A)t ; Z) in (85). This is always possible, because 
the difference has opposite signs for A = and A = 1. By applying the function (exp(|--) — to both sides 
of (85), we find the inequality 

M t (X + Z' U + Z'l) > M xt (X + Z' u ) + va\ 

We now let t — ► (so that Xt — ► 0). Since fx+z' u (t) = fx(t + u)/fx(u), and similarly, fx+z' u +z' v '(t) = 
fx(t + u + v)/fx{u + v), we obtain 

fx(u + v) fx(u) 
f x (u + v) ~ f' x (u) 

Dividing by v and letting v — > gives 



duf' x (u) 

that is, carrying out the derivation, f x (u)f x (u)/f' x (u) 2 < or f x (u) < as required. □ 

VIII. Open Questions 

A. f77 anc/ M// /or Subsets of Independent Variables 

Recently, Artstein, Ball, Barthe, and Naor [45] proved a new entropy power inequality involving entropy powers 
of sums of all independent variables excluding one, which solved a long-standing conjecture about the monotonicity 
of entropy. This was generalized to arbitrary collections of subsets of independent variables (or vectors) by Madiman 
and Barron [46], [47]. The generalization of the classical formulation of the EPI takes the form 

jv(E*)> ^E^E**) (86) 

i S ieS 

where the sum in the right-hand side is over arbitrary subsets S of indexes, and k is the maximum number of 
subsets in which one index appears. Note that we may always assume that subsets S are "balanced" [47], i.e., 
each index i appears in the right-hand side of (86) exactly k times. This is because it is always possible to add 
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singletons to a given collection of subsets until the balancing condition is met; since the EPI (86) would hold for 
the augmented collection, it a fortiori holds for the initial collection as well. 

For balanced subsets, the inequalities generalizing (5c), (14c), (30) and (45) are the following. 

Proposition 9: Let (Xj), be finitely many random n-vectors, let Z be any Gaussian random n-vector independent 
of (Xi)i, and let (aj)j be any real-valued coefficients normalized such that J2i a i = !• Then, for any collection 
{S} of balanced subsets of indexes, 

■7(X>*i)<$>! J(*s), (87a) 

i S 

Var(^ ai X t \ aiXi + Z)>J2 a|Var(X s |*s + Z), (87b) 

i i s 

H(J2 OiXi) > H ( X s), (87c) 

i S 

ai x i + Z;Z)<Y,4 I(Xs + Z;Z), (87d) 

i S 

where a| = i X^es a i ( so tnat a s = 1) anc ' is given by the covariance preserving transformation 



X s = 



Available proofs of (87a)-(87c) are generalizations of the conventional techniques presented in Section II, where 
an additional tool ("variance drop lemma") is needed to prove either (87a) or (87b); see [45, Lemma 5], [48, 
Lemma3], or [47, Lemma 2]. Artsein, Ball, Barthe & Naor's proof of the EPI (87c), which is generalized and 
simplified by Madiman and Barron, is through an integration of the FII (87a) over the path {y/tX + y/1 — tZ} 
(in [45], see (44b)) or {X + \ft Z} (in [47], see (44a)). Tulino and Verdui provided the corresponding proof via 
MMSE [48], through an integration of the MMSE inequality (87b) over the path {VtX + Z} (see (44c)). Again 
the approaches corresponding to (87a) and (87b) are equivalent by virtue of the complementary relation (26), as 
explained in section II-C.3. 

That the Mil (87d) holds is easily shown through (87a) or (87b) and de Bruijn's identity (38) or (39). However, 
the author was not able to extend the ideas in the proof of Theorem 1 to provide a direct proof of the Mil (87d), 
which letting a\ — > oo would yield an easy proof of the generalized EPI (87c). Such an extension perhaps 
involves a generalization of the data processing inequality or Sato's inequality in (50), which using the relation 

J2 t a i X l = J^s a s x s would yield the inequality /(^ + Z;Z) < J2s H x s + a s Z; Z) + o(a%). 

B. EPI, FII and Mil for Gas Mixtures 

There is a striking resemblance between the original inequalities (5c), (14c), (30) and (45) for linear mixtures 
of independent random vectors, and known inequalities concerning entropy and Fisher information for linear "gas 
mixtures" of probability distributions. 
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Proposition 10: Let the random variable I have distribution p(i) = af where J2i a 1 = 1> l et (-^»)» be finitely 
many random n-vectors independent of /, and let Z be white Gaussian, independent of (Xi)i and /. Then 

J(Xi) <^2ap(Xi) (88a) 

i 

Var(X/|X/ + Z) > a?Var(X J |X J + Z) (88b) 
ff(Xj) > J>?tf(X0 (88c) 

i 

I(X I + Z;Z)<J2ap{X i + Z;Z). (88d) 

z 

Noting that Xj has distribution {x) = Y^i P(i)p{ x \i) = Si ° 2 PX; (x). the "FTI" (88a) can be proved directly 
as follows. Let S(x) = \7p Xl (x)/px I (x) and Si(x) = \7p Xi (x)/px i (x) be the score functions of Xj and the 
(Xi)i, respectively, and define Xi(x) = afpx i (x)/px I (x) for all i. Then J2iK(x) = 1, S(a;) = ^ Aj(a:)iSj(a;) 
and since the squared norm is convex, ^(a;)!) 2 < J2i ^i( x ) \\Si(x)\\ 2 = P x )(x) J2i a iPXi{x)\\Si(x)\\ 2 . Averaging 
over pxj (x) gives (88a). 

Once (88a) is established, the conventional techniques presented in Section II can be easily adapted to deduce 
the other inequalities (88b)-(88d): Substituting (X, t + Z i ) l for in (88a), where the (Zj)$ are independent 

copies of Z, and noting that Xj + Zj has the same probability distribution as Xi + Z, we obtain the inequality 
J(Xj + Z) < J^itfJiXi + Z); applying the complementary relation (26) gives (88b); integrating using de Bruijn's 
identity (38) or (39) gives (88d), from which (88c) follows as in the proof of Theorem 1. 

In the present case, however, (88c) and (88d) are already well known. In fact, since pxj (x) = s-px, (x) is 
a convex combination of distributions, the "EPF (88c) is nothing but the classical concavity property of entropy, 
seen as a functional of the probability distribution [9], [84], [85]. This is easily established by noting that since 
conditioning decreases entropy, H{X[) > H{Xi\I) — ^2 i p(i)H(X i ). Also the "Mil" (88d) is just the classical 
convexity of mutual information I(Y,Z), seen as a functional of the distribution p{y\z) for fixed p{z) [9, Thm. 
2.7.4], [84, Thm. 4.4.3], [85, Thm. 1.7]. 

Accordingly, we may reverse the order of implication and derive the corresponding convexity property of Fisher 
information (88a) anew from the "Mil" (88d). Indeed, (88d) can be rewritten in the form 

H(Xj + VtZ)- H{Xt) < a 2 (H(Xi + VtZ)- H(Xi)) (t > 0). 

i 

Dividing both sides by t and letting t — > gives (88a) by virtue of de Bruijn's identity. This derivation is much 
shorter than earlier proofs of inequality (88a) [4, Lemma 6], [97]. The convexity property of Fisher information 
finds application in channel estimation [53] and thermodynamics [98]. 

It is perhaps worthwhile to investigate the relationship between (88) and the original EPI and FII further. In 
particular, for independent random vectors {X{)i, it is plausible that the linear combination ^ i djXj is "more 
Gaussian" than the corresponding gas mixture Xi, in the sense that non-Gaussianness (3) or (12) is lower. Since, 



April 1, 2007 



DRAFT 



37 



as it is easy to check, 



Cov(A-j) = Cov(£a i X i ), 



we are led to conjecture that 



(89a) 



(89b) 



which if true and combined with Proposition 10 would yield an easy proof of the original EPI and FII. 
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